Find MDCs associated with Medicaid and/or Private Insurance payer types.
Analyzing MDC codes from all admissions in HCUP NY SID 2006-2012.
All plots show number of admissions by Major Diagnostic Criteria
MDCs ordered in descending order of Medicaid admission counts
MDCs ordered in descending order of Medicaid admission proportion (of all admissions)
MDCs ordered in descending order of ratio of Medicaid:Private Insurance admissions
K-means clustering classifies MDCs into k groups such that MDCs within the same cluster are as similar as possible, and MDCs from different clusters are as dissimilar as possible. For our data, similarity is represented by the number of discharges/admissions from each payer type.
K-means defines clusters by trying to minimize the total within-cluster variation. The standard algorithm (Hartigan-Wong (1979)) defines the within-cluster variation as the sum of squared Euclidean distances between each MDC and its corresponding cluster centroid:
\[W(C_k) = \sum_{x_i \in C_k} (x_i - \mu_k)^2\]
where:
The algorithm tries the minimize the total within-cluster varition:
\[Total.Within.SS = \sum_{k=1}^{k} W(C_k) = \sum_{k=1}^{k} \sum_{x_i \in C_k} (x_i - \mu_k)^2\]
K-means algorithm can be summarized as:
Implemented k-means clustering for \(k=[2,15]\). Below are plots of clusters for \(k=[2,6]\). Usually these plots are projected on the first two primary components, but we are specifically interested in two specific dimensions (Medicaid & Private Insurance admissions).
Recall k-means defines clusters by minimizing the the total within-cluster variation (Total.Within.SS). We can plot the Total.Within.SS against the number of clusters k to decide the optimal number of clusters.
As k increases, the Total.Within.SS approaches 0. Generally, researchers use the “elbow method” for finding the value of k where the line bends as the point where there are diminishing returns in reducing the Total.Within.SS.
The above scree plot implies that k=5 is the optimal number of clusters. However, recall that we are clustering MDCs based on number of discharges per payer type, but we are only interested in trying to find subsets of MDCs that are more associated to either Medicaid or Private Insurance.